support weight-update in disaggregated mode using sglang by PengchengShi00 · Pull Request #1766 · InternLM/xtuner

PengchengShi00 · 2026-05-07T13:38:03Z

将TrainingWorker中有关权重同步的函数抽取到UpdateWeighter类中
将PR中有关训推共卡权重同步的优化更新到UpdateWeighter类中，Update weight persist buffer #1653
修复sglang在跑训推分离时，rollout和train所占GPU没有分开部署的 bug
更新配置GSM8KJudgerConfig的配置参数
增加训推分离模式下，使用sglang作为推理后端时的权重同步
a. 创建训练 ranks 之间使用的 gloo group，训推分离权重同步时通过该group做 barrier
b. 创建了一个 NCCL process group，用来将训练 rank0 把 bucket 后的权重 broadcast 给 SGLang rollout ranks：

CyCle1024 · 2026-05-08T08:53:52Z

@@ -1155,5 +1158,7 @@ async def _sync_weights_and_save(self, train_step: int, step_timer_dict: dict):
            self.fake_update_weights()

    def fake_update_weights(self):


rename function, if this is a real udpate weight operation, please remove fake

Updated in the latest commit: fake_update_weights has been renamed to update_weights.

CyCle1024 · 2026-05-11T06:48:15Z

+REPO_ROOT = Path(__file__).resolve().parents[2]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+TEST_DIR = Path(__file__).resolve().parent
+if str(TEST_DIR) not in sys.path:
+    sys.path.insert(0, str(TEST_DIR))


It's unnecessary in ci ut test.

CyCle1024 · 2026-05-11T06:50:23Z


    def pause_generation(self):
-        return self._make_request("pause_generation")
+        return self._make_request("pause_generation", {"mode": "retract"})


add comment explaining extra param.

have added a comment in the latest commit.

SGLang PauseGeneration supports three modes:

abort: drop both waiting and running requests

retract: keep waiting/running requests and generated tokens, release KV cache, and recompute KV on resume

in_place: keep waiting/running requests, generated tokens, and KV cache, and resume directly

I also changed the mode to abort. Before update_weights, send_abort_request has already been issued, so there should be no pending requests to preserve. In this case, abort is sufficient and makes the intended behavior clearer.

CyCle1024 · 2026-05-11T07:52:26Z

+        ray.get([worker.reset_update_weight_sha256.remote() for worker in train_controller.workers])
+        train_controller.update_weights()
+        first_hashes = ray.get([worker.get_update_weight_sha256.remote() for worker in train_controller.workers])
+
+        ray.get([worker.reset_update_weight_sha256.remote() for worker in train_controller.workers])
+        train_controller.update_weights()
+        second_hashes = ray.get([worker.get_update_weight_sha256.remote() for worker in train_controller.workers])


This hash logic is about verifying the deterministic operation of update weight. But hashable result should be used when comparing training side sent state_dict and rollout side received state_dict.

In the latest commit, I added per-bucket hash checks for both the training-side sent state_dict and the rollout-side received state_dict.

To support this comparison, I also needed a small SGLang-side change so the rollout side can return the received bucket hash. I have rebuilt the docker image with that SGLang patch applied.

In the latest commit, the unit tests now verify:

The rollout output remains unchanged for the same input before and after the weight update.

For each bucket, the training-side sent state_dict hash matches the rollout-side received state_dict hash.

Across two consecutive weight updates, the training-side sent bucket state_dict hashes remain identical.

CyCle1024 · 2026-05-11T08:20:21Z

+            self.request_update_params(state_dict, train_enable_ep=train_enable_ep, finished=False)
+            del state_dict, name_list, param_list
+
+        if self.rollout_cfg_info["backend"] == "pytorch" and final_update:


Suggested change

if self.rollout_cfg_info["backend"] == "pytorch" and final_update:

if self.rollout_cfg_info["backend"] in ("pytorch", "vllm") and final_update:

CyCle1024 · 2026-05-11T08:40:02Z

+        )
+
+    @unittest.skipIf(os.environ.get("XTUNER_USE_LMDEPLOY", "0") == "0", "lmdeploy backend is not enabled")
+    def test_lmdeploy_update_weight_and_generate(self):


it's not a disaggregate case, and duplicated of the case in tests/rl/test_update_weight.py, it should be removed or forced skipping for now.

CyCle1024 · 2026-05-11T08:53:46Z

+from xtuner.v1.utils import ray_method
+
+TEST_TEXT_MESSAGES = [{"role": "user", "content": "Hello!"}]
+MODEL_PATH = os.environ.get("MODEL_PATH") or os.environ.get("QWEN3_VL_DENSE_PATH")


should remove os.environ.get("MODEL_PATH") here, just use original CI ENV

CyCle1024 · 2026-05-11T08:58:33Z

+        )
+
+        # training config
+        model_cfg = get_model_config_from_hf(Path(MODEL_PATH))


maybe use specific model here for using QWEN3_VL_DENSE_PATH, see tests/rl/test_update_weight.py

CyCle1024 · 2026-05-11T12:54:32Z

+    def setUpClass(cls) -> None:
+        if MODEL_PATH is None:
+            raise unittest.SkipTest("MODEL_PATH is not set")
+        os.environ["XTUNER_USE_FA3"] = "1"


NCCL_CUMEM_ENABLE=0 is required in my test environment, we may add it here or in the Actor creation stage by runtime_env of ray.remote

NCCL_CUMEM_ENABLE=0 is also required in my test environment. The latest commit has already added this environment variable in the unit tests.

CyCle1024 · 2026-05-11T12:56:19Z

BTW, fix CI lint.

… SGLang rollout_url check

support weight-update in Disaggregated mode using sglang

f6efdac

CyCle1024 self-requested a review May 8, 2026 08:39

add pause_generation to engine before update weight

e4d23ea

CyCle1024 reviewed May 11, 2026

View reviewed changes

root added 2 commits May 12, 2026 20:03

add stronger SGLang-side SHA256 validation in unit tests and fix CI lint

a04aa9e

add train/rollout mode check at update_weights entry and remove stale…

cd72b18

… SGLang rollout_url check

		@@ -1155,5 +1158,7 @@ async def _sync_weights_and_save(self, train_step: int, step_timer_dict: dict):
		self.fake_update_weights()

		def fake_update_weights(self):

	if self.rollout_cfg_info["backend"] == "pytorch" and final_update:
	if self.rollout_cfg_info["backend"] in ("pytorch", "vllm") and final_update:

Conversation

PengchengShi00 commented May 7, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyCle1024 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants